[SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible #35528

AngersZhuuuu · 2022-02-15T09:57:44Z

What changes were proposed in this pull request?

Currently spark sql

INSERT OVERWRITE DIRECTORY 'path'
STORED AS PARQUET
query

can't be converted to use InsertIntoDataSourceCommand, still use Hive SerDe to write data, this cause we can't use feature provided by new parquet/orc version, such as zstd compress.

spark-sql> INSERT OVERWRITE DIRECTORY 'hdfs://nameservice/user/hive/warehouse/test_zstd_dir'
         > stored as parquet
         > select 1 as id;
[Stage 5:>                                                          (0 + 1) / 1]22/02/15 16:49:31 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, ip-xx-xx-xx-xx, executor 21): org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: No enum constant parquet.hadoop.metadata.CompressionCodecName.ZSTD
	at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
	at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:123)
	at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:269)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:203)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:202)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Why are the changes needed?

Convert InsertIntoHiveDirCommand to InsertIntoDataSourceCommand can support more features of parquet/orc

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

…onvertible

AngersZhuuuu · 2022-02-17T02:53:34Z

Gentle ping @cloud-fan @viirya Could you take a review? it's a useful feature.

AngersZhuuuu · 2022-02-17T07:39:57Z

Also ping @dongjoon-hyun @HyukjinKwon

cloud-fan · 2022-02-17T09:27:27Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala


+  private def convertProvider(storage: CatalogStorageFormat): String = {
+    val serde = storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
+    Some("parquet").filter(serde.contains).getOrElse("orc")


nit:

if (serde.contains("parquet")) parquet else orc

is much simpler

if (serde.contains("parquet")) parquet else orc

updated

AngersZhuuuu · 2022-02-18T02:37:18Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

 * - When writing to partitioned Hive-serde Parquet/Orc tables when
 *   `spark.sql.hive.convertInsertingPartitionedTable` is true
+ * - When writing to directory with Hive-serde
+ * - When writing to non-partitioned Hive-serde Parquet/ORC tables using CTAS


@cloud-fan Update the comment of this rule, also add comment about CTAS

AngersZhuuuu · 2022-02-18T06:18:51Z

Gentle ping @cloud-fan GA passed.

cloud-fan · 2022-02-18T08:04:33Z

thanks, merging to master!

PengleiShi · 2023-09-12T08:46:53Z

@AngersZhuuuu Hi，in the case of inserted dir has same path as selected table location, this may cause error. https://issues.apache.org/jira/browse/SPARK-38215

[SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's c…

126d80f

…onvertible

github-actions bot added the SQL label Feb 15, 2022

Update HiveDDLSuite.scala

5ecb153

cloud-fan reviewed Feb 17, 2022

View reviewed changes

cloud-fan approved these changes Feb 17, 2022

View reviewed changes

AngersZhuuuu added 2 commits February 17, 2022 17:31

Update HiveStrategies.scala

4093145

Update comment

f8c55d7

AngersZhuuuu commented Feb 18, 2022

View reviewed changes

cloud-fan approved these changes Feb 18, 2022

View reviewed changes

cloud-fan closed this in a92f873 Feb 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible #35528

[SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible #35528

Uh oh!

AngersZhuuuu commented Feb 15, 2022

Uh oh!

AngersZhuuuu commented Feb 17, 2022

Uh oh!

AngersZhuuuu commented Feb 17, 2022

Uh oh!

cloud-fan Feb 17, 2022

Uh oh!

AngersZhuuuu Feb 17, 2022

Uh oh!

AngersZhuuuu Feb 18, 2022

Uh oh!

AngersZhuuuu commented Feb 18, 2022

Uh oh!

cloud-fan commented Feb 18, 2022

Uh oh!

PengleiShi commented Sep 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible #35528

[SPARK-38215][SQL] InsertIntoHiveDir should use data source if it's convertible #35528

Uh oh!

Conversation

AngersZhuuuu commented Feb 15, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AngersZhuuuu commented Feb 17, 2022

Uh oh!

AngersZhuuuu commented Feb 17, 2022

Uh oh!

cloud-fan Feb 17, 2022

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Feb 17, 2022

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Feb 18, 2022

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu commented Feb 18, 2022

Uh oh!

cloud-fan commented Feb 18, 2022

Uh oh!

PengleiShi commented Sep 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants